Skip to content

fix for bug when fused pq is used with no hierarchy#664

Merged
MarkWolters merged 1 commit into
compaction-prfrom
fix/compaction-pr-hierarchy-fix
May 7, 2026
Merged

fix for bug when fused pq is used with no hierarchy#664
MarkWolters merged 1 commit into
compaction-prfrom
fix/compaction-pr-hierarchy-fix

Conversation

@MarkWolters
Copy link
Copy Markdown
Contributor

The Bug

CompactWriter.writeFooter() is the counterpart of AbstractGraphIndexWriter.writeSparseLevels(). Both methods write the "source feature" block that sits between the L0 node data and
the actual footer bytes. OnDiskGraphIndex.loadInMemoryFeatures() reads this block to populate hierarchyCachedFeatures, which FusedPQDecoder.similarityTo() uses.

writeSparseLevels() has two branches:

  • Hierarchy enabled (getMaxLevel >= 1): writes a PQ code for every level-1 node
  • Hierarchy disabled (getMaxLevel == 0): writes one entry — the entry node's own PQ code

CompactWriter.writeFooter() had only the first branch (via level1FeatureRecords). With enableHierarchy: false, level1FeatureRecords is always empty, so nothing gets written. When
loadInMemoryFeatures then tries to read one entry from what it thinks is that block, it consumes bytes from the footer header instead — populating hierarchyCachedFeatures with one
garbage entry keyed on a non-zero integer. When GraphSearcher.initializeInternal calls scoreFunction.similarityTo(entry.node) (line 331), entry.node is 0 (first remapped ordinal),
which is not in the garbage-keyed map → Node 0 is not in the hierarchy.

The Fix (2 files)

  • CompactWriter.java: Added entryNode and entryNodePqCode fields, a setEntryNodePqCode() setter, and an else if (entryNodePqCode != null) branch in writeFooter() that writes exactly
    what AbstractGraphIndexWriter writes in its no-hierarchy branch.
  • OnDiskGraphIndexCompactor.java: Refactored resolveEntryNode() into resolveEntryNodeSource() (returns {sourceIdx, originalOrdinal}), and in compact(), after compactLevels()
    completes, reads the entry node's vector from its source graph, encodes it with the retrained PQ, and passes it to the writer before calling writeFooter().

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 7, 2026

Before you submit for review:

  • Does your PR follow guidelines from CONTRIBUTIONS.md?
  • Did you summarize what this PR does clearly and concisely?
  • Did you include performance data for changes which may be performance impacting?
  • Did you include useful docs for any user-facing changes or features?
  • Did you include useful javadocs for developer oriented changes, explaining new concepts or key changes?
  • Did you trigger and review regression testing results against the base branch via Run Bench Main?
  • Did you adhere to the code formatting guidelines (TBD)
  • Did you group your changes for easy review, providing meaningful descriptions for each commit?
  • Did you ensure that all files contain the correct copyright header?

If you did not complete any of these, then please explain below.

@dian-lun-lin
Copy link
Copy Markdown

LGTM

@MarkWolters MarkWolters marked this pull request as ready for review May 7, 2026 20:40
@MarkWolters MarkWolters merged commit 632bc76 into compaction-pr May 7, 2026
9 checks passed
@MarkWolters MarkWolters deleted the fix/compaction-pr-hierarchy-fix branch May 7, 2026 20:40
eolivelli added a commit to eolivelli/jvector that referenced this pull request May 9, 2026
* Add on-disk graph index compaction algorithm

Introduce OnDiskGraphIndexCompactor and PQRetrainer for streaming N:1
merging of on-disk HNSW indexes without full in-memory materialization.
Supports deletion filtering via live-node bitsets, custom ordinal
mapping, and PQ codebook retraining.

* Add compaction unit tests

Tests for OnDiskGraphIndexCompactor covering basic compaction, deletions,
ordinal remapping, multi-source merging, and FusedPQ compaction scenarios.

* Add reporting and storage infrastructure for CompactorBenchmark

Add JFR recording, system stats collection, JSONL logging, git info
capture, thread allocation tracking, dataset partitioning, and cloud
storage layout utilities used by CompactorBenchmark. Switch
jvector-examples logging from logback to log4j2 for consistency with
benchmarks-jmh and to avoid duplicate SLF4J bindings in the fat jar.

* Add CompactorBenchmark and tooling

JMH-based benchmark with configurable workload modes (PARTITION_AND_COMPACT,
PARTITION_ONLY, COMPACT_ONLY, BUILD_FROM_SCRATCH), recall measurement, JFR
recording, and JSONL result logging. Includes BenchmarkParamCounter for
progress tracking, EventLogAnalyzer for post-run analysis, GHA workflow,
and exec-maven-plugin integration. Add forced vectorization provider
property to VectorizationProvider for benchmark reproducibility.

* Update build config and project metadata for compaction

Add result file patterns to .gitignore, update rat-excludes for the new
compaction workflow and catalog cache files.

* Fix JMH jar selection in run-compaction.yml

The benchmarks-jmh-*.jar glob matched the -javadoc jar first, which has
no Main-Class. Select the shaded JMH jar explicitly by excluding
-javadoc and -sources jars.

* Fix CompactorBenchmark invocation in run-compaction.yml

Use -cp with CompactorBenchmark.main() instead of -jar with JMH Main
to avoid BenchmarkList discovery issues in CI's shaded jar.

* Address PR review feedback

- Extract CompactWriter into its own file to reduce OnDiskGraphIndexCompactor size
- Rewrite SystemStatsCollector to read /proc files directly in Java instead of spawning bash
- Clarify recall section description in docs/compaction.md

* Fix benchmark invocation in docs and default dataset

Use -cp instead of -jar in docs since the benchmarks-jmh-*.jar glob
matches the -javadoc jar first. Change default dataset from
glove-100-angular to ada002-100k. Note -Xmx should be adjusted to
fit the dataset.

* Fix jar selection: use fixed output name compactor-benchmark.jar

The benchmarks-jmh-*.jar glob expands to multiple jars (shaded +
javadoc), causing -cp to misinterpret the second jar as the main
class. Configure shade plugin outputFile to produce a fixed
compactor-benchmark.jar name. Update docs and CI workflow.

* Refactor workload modes and fix build-from-scratch timing

Simplify WorkloadMode enum: PARTITION_ONLY/COMPACT_ONLY/COMPACT_AND_RECALL/
BUILD_FROM_SCRATCH collapsed into PARTITION/COMPACT/BUILD/PARTITION_AND_COMPACT
plus a separate measureRecall flag.

Fix buildFromScratch timing to include PQ computation and graph
construction (previously only timed the write step).

Add fair comparison guidelines to CompactorBenchmark.md.

* Add TIERED_10_90 and TIERED_1_99 split distributions

Support 10%/90% and 1%/99% partition splits for benchmarking compaction
of a small new segment into a large existing index. Add split
distribution reference table to CompactorBenchmark.md.

* fix for bug when fused pq is used with no hierarchy (datastax#664)

---------

Co-authored-by: dian-lun-lin <cyc4542000@gmail.com>
Co-authored-by: Mark Wolters <mwolters138@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants